Take Home Exercise 3

Author

Georgia Ng

Published

October 17, 2024

Modified

November 3, 2024

1. Overview

1.1 Introduction

Recent MRT service breakdowns in Singapore have highlighted the importance of transportation accessibility for housing. This situation, combined with the rising influx of expats seeking rental housing, has intensified competition for well-located flats. As access to reliable public transport becomes a pivotal factor for potential renters, understanding its influence on rental prices is more critical than ever.

With this project, we aim to provide individuals looking to rent a flat with a better understanding of how transportation accessibility and other local factors affect rental prices. By leveraging Geographically Weighted Regression (GWR), our analysis will help prospective renters make informed decisions, ultimately improving their housing choices in Singapore’s dynamic urban landscape.

1.2 My Responsibilities

  • Data Preparation, Preprocessing
  • Explanatory Data Analysis (EDA)

1.3 Importing Packages

Here, we have loaded the following packages:

pacman::p_load(sf, sfdep, tmap, tidyverse, RColorBrewer, ggplot2, spatstat, jsonlite, units, matrixStats, httr)
  • sf: Provides tools for handling and analyzing spatial vector data using simple features, facilitating operations like spatial joins, buffering, and transformations.

  • sfdep: Adds spatial dependency features for the sf package, allowing for the analysis of spatial autocorrelation and other spatial dependencies.

  • tmap: Used for creating and visualizing thematic maps, supporting both static and interactive mapping with various customization options.

  • tidyverse: A collection of R packages designed for data science, including tools for data manipulation (dplyr), visualization (ggplot2), and tidying data (tidyr).

  • RColorBrewer: Provides color palettes that are colorblind-friendly and designed for creating visually appealing maps and charts, suitable for categorical, sequential, and diverging data.

  • ggplot2: A powerful system for creating static graphics in R, based on the grammar of graphics, allowing for highly customizable visualizations.

  • spatstat: Offers tools for analyzing spatial point patterns, including functions for point process modeling, density estimation, and spatial statistics.

  • jsonlite: Simplifies the process of working with JSON data in R, enabling easy conversion between JSON and R objects for data interchange and web APIs.

  • units: Provides support for unit conversions in R, allowing for handling and manipulation of quantities with associated units (e.g., meters, feet) in a consistent way.

  • matrixStats: Offers highly optimized functions for computing statistics on matrices, particularly useful for large datasets, enabling efficient calculations for row and column operations.

  • httr: Simplifies working with HTTP requests in R, providing functions for sending requests, handling responses, and working with REST APIs.

2. The Data

For this project, we will be using the following data sets.

  • Singapore Rental Flat Prices (Jan-17 to Sep-24) from data.gov.sg, we will only be using the data from 2024

  • Master Plan 2014 Subzone Boundary (Web) from data.gov.sg

  • Hawker Centres Dataset from data.gov.sg

  • Kindergarten, Childcare, Primary School Datasets from OneMap API

  • Bus Stops Location, MRT/ LRT Locations from LTA Data Mall

  • Shopping Malls Coordinates through wikipedia and webscraping with the coordinates retrieved through OneMap API

2.1 Importing Geospatial Data

2.1.1 Importing Singapore Subzone Boundaries

The code chunk below is used to import MP_SUBZONE_WEB_PL shapefile by using st_read() of sfpackages.

mpsz_sf <- st_read(dsn = "data/geospatial/MasterPlan2014SubzoneBoundaryWebSHP/", layer = "MP14_SUBZONE_WEB_PL") %>%
  st_transform(crs = 3414)
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `/Users/georgiaxng/georgiaxng/is415-handson/Take-home_Ex/Take-home_Ex03/data/geospatial/MasterPlan2014SubzoneBoundaryWebSHP' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

Using st_crs, we can check the coordinate system.

st_crs(mpsz_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

2.1.1.1 Checking Validity of Geometries

Using st_is_valid, we can check to see whether all the polygons are valid or not. From the results, we can see a total of 9 not valid.

# checks for the number of geometries that are invalid
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 9

To rectify this, we can use st_make_valid() to correct these invalid geometries as demonstrated in the code chunk below.

mpsz_sf <- st_make_valid(mpsz_sf)
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0

2.1.1.2 Grouping By Town

Since our data is categorised by towns, it will be useful for us to group the geometries like so as well so that we can later use this to

# Combine geometries by town in mpsz_sf
combined_town_mpsz_sf <- mpsz_sf %>%
  group_by(PLN_AREA_N) %>%  # Group by town identifier
  summarize(geometry = st_union(geometry), .groups = 'drop')

write_rds(combined_town_mpsz_sf, 'data/rds/mpsz_grped_by_town.rds')

2.1.2 Importing Kindergartens

This chunk of code imports the kindergartens data.

kindergarten_json <- fromJSON("data/geospatial/kindergartens.json")

kindergarten_cleaned <- kindergarten_json$SrchResults[-1, ]

kindergarten_df <- data.frame(
  NAME = kindergarten_cleaned$NAME,
  latitude = sapply(kindergarten_cleaned$LatLng, function(x) as.numeric(unlist(strsplit(x, ","))[1])),
  longitude = sapply(kindergarten_cleaned$LatLng, function(x) as.numeric(unlist(strsplit(x, ","))[2]))
)

kindergarten_sf <- kindergarten_df %>%
  st_as_sf(coords = c("longitude", "latitude"), crs=4326) %>%
  st_transform(crs = 3414)

2.1.3 Importing Childcare

childcare_json <- fromJSON("data/geospatial/childcare.json")

childcare_cleaned <- childcare_json$SrchResults[-1, ]

childcare_df <- data.frame(
  NAME = childcare_cleaned$NAME,
  latitude = sapply(childcare_cleaned$LatLng, function(x) as.numeric(unlist(strsplit(x, ","))[1])),
  longitude = sapply(childcare_cleaned$LatLng, function(x) as.numeric(unlist(strsplit(x, ","))[2]))
)

childcare_sf <- childcare_df %>%
  st_as_sf(coords = c("longitude", "latitude"), crs=4326) %>%
  st_transform(crs = 3414)

2.1.4 Importing Hawker Centre

Similarly here, we will use st_read to read the geojson information, however since the columns values are in the format of html code of ’

’ etc we will need to use a function to apply a regex expression in order to extract the name of the hawker centres.

hawker_sf <- st_read('data/geospatial/HawkerCentresGEOJSON.geojson')
Reading layer `HawkerCentresGEOJSON' from data source 
  `/Users/georgiaxng/georgiaxng/is415-handson/Take-home_Ex/Take-home_Ex03/data/geospatial/HawkerCentresGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
# Function to extract name from description
extract_name <- function(description) {
  if (!is.na(description)) {
    # Use regular expression to extract the NAME 
    name <- sub(".*<th>NAME</th> <td>(.*?)</td>.*", "\\1", description)
    if (name == description) {
      return(NA)  # Return NA if no match is found
    }
    return(name)
  } else {
    return(NA) 
  }
}

# Apply the extraction function to every row
hawker_sf <- hawker_sf %>%
  mutate(Name = sapply(Description, extract_name)) %>% select (-Description)

Here, we can see that the hawker centres are now appropriately named.

Simple feature collection with 6 features and 1 field
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.7798 ymin: 1.284425 xmax: 103.9048 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
                             Name                      geometry
1     Market Street Hawker Centre POINT Z (103.8502 1.284425 0)
2    Marsiling Mall Hawker Centre POINT Z (103.7798 1.433539 0)
3    Margaret Drive Hawker Centre POINT Z (103.8047 1.297486 0)
4 Fernvale Hawker Centre & Market POINT Z (103.8771 1.391592 0)
5       One Punggol Hawker Centre  POINT Z (103.9048 1.40819 0)
6    Bukit Canberra Hawker Centre POINT Z (103.8225 1.449017 0)

As shown above, we can see that the geographic coordinate system for the hawker dataset is in WGS84 and has XYZ coordinates, among which contains the Z-coordinates we do not need. Thus, we can use st_zm() to remove the Z-coordinate and project it to the SVY21 coordiate system using st_transform().

hawker_sf <- st_zm(hawker_sf) %>%
  st_transform(crs = 3414)

head(hawker_sf)
Simple feature collection with 6 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22042.51 ymin: 29650.7 xmax: 35955.52 ymax: 47850.43
Projected CRS: SVY21 / Singapore TM
                             Name                  geometry
1     Market Street Hawker Centre  POINT (29874.82 29650.7)
2    Marsiling Mall Hawker Centre POINT (22042.51 46139.03)
3    Margaret Drive Hawker Centre  POINT (24816.7 31094.91)
4 Fernvale Hawker Centre & Market  POINT (32867.9 41500.77)
5       One Punggol Hawker Centre POINT (35955.52 43336.13)
6    Bukit Canberra Hawker Centre POINT (26794.39 47850.43)

2.1.5 Importing Bus Stops

Here we are importing the bus stop locations using st_read and also converting it to the SVY21 coordinate system.

busstop_sf <- st_read(dsn = "data/geospatial/BusStopLocation_Jul2024/", layer = "BusStop")%>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `/Users/georgiaxng/georgiaxng/is415-handson/Take-home_Ex/Take-home_Ex03/data/geospatial/BusStopLocation_Jul2024' 
  using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21

2.1.6 Importing Shopping Malls

Here we are importing the shopping mall locations using read_csv and also converting it to the SVY21 coordinate system.

shoppingmall_sf <- read_csv('data/geospatial/shopping_mall_coordinates.csv') %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs=4326) %>%
  st_transform(crs = 3414)

2.1.7 Importing MRT

Here we are importing the mrt locations using st_read.

mrt_sf <- st_read(dsn = "data/geospatial/TrainStation_Jul2024/", layer = "RapidTransitSystemStation")
Reading layer `RapidTransitSystemStation' from data source 
  `/Users/georgiaxng/georgiaxng/is415-handson/Take-home_Ex/Take-home_Ex03/data/geospatial/TrainStation_Jul2024' 
  using driver `ESRI Shapefile'
Simple feature collection with 230 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21

Having imported the dataset, we will now need to check for both invalid geometries and NA values before proceeding. The chunk of code below detects not only these but also resolves it. The final printed result shows that all geometries are now valid.

# Check for invalid geometries and NA values
validity_checks <- st_is_valid(mrt_sf, reason = TRUE)

# Identify indices with NA
na_indices <- which(is.na(validity_checks))

# Filter out rows with NA values from the mrt object
mrt_sf <- mrt_sf[-na_indices, ]

# Verify the mrt object no longer contains invalid geometries
any(is.na(sf::st_is_valid(mrt_sf)))
[1] FALSE

Here we use st_transform() to convert it to the SVY21 Coordinates System of CRS code 3414.

mrt_sf <- mrt_sf %>%
  st_transform(crs = 3414)

2.1.8 Importing Primary School

This chunk of code imports the primary school dataset from data.gov.sg and uses the select() function to select the relevant columns through the input of the column numbers.

primarysch_df = read_csv('data/geospatial/Generalinformationofschools.csv') %>% filter(mainlevel_code =='PRIMARY') %>% select(1,3,4)

2.1.8.1 Geocoding Primary School Data using OneMap API

Since this dataset only has the addresses and not the actual coordinates, we will need to use the OneMapAPI to geocode these addresses. This chunk of code contains a function whereby the OneMapApi is called upon and returns the actual latitude and longitude of the addresses inputted.

Click to view the code
geocode <- function(address, postal) {
  base_url <- "https://www.onemap.gov.sg/api/common/elastic/search"
  query <- list("searchVal" = address,
                "postal" = postal,
                "returnGeom" = "Y",
                "getAddrDetails" = "N",
                "pageNum" = "1")
  
  res <- GET(base_url, query = query)
  restext<-content(res, as="text")
  
  output <- fromJSON(restext)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}

This chunk of code creates two columns for latitude and longitude and sets the default values to 0. Then it loops through every single row of the primary school dataset and calls upon the above function to populate the respective latitude and longitude values for each row.

Click to view the code
primarysch_df$LATITUDE <- 0
primarysch_df$LONGITUDE <- 0

for (i in 1:nrow(primarysch_df)){
  temp_output <- geocode(primarysch_df[i, 2], primarysch_df[i, 3])
  print(i)
  
  primarysch_df$LATITUDE[i] <- temp_output$results.LATITUDE
  primarysch_df$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
write_rds(primarysch_df, 'data/rds/geocoded_primarysch.rds')

As shown below, using head() we can see that the new columns for lat and long has been added with the values fetched using the OneMap API.

glimpse(primarysch_df)
Rows: 178
Columns: 3
$ school_name <chr> "ADMIRALTY PRIMARY SCHOOL", "AHMAD IBRAHIM PRIMARY SCHOOL"…
$ address     <chr> "11   WOODLANDS CIRCLE", "10   YISHUN STREET 11", "100  Br…
$ postal_code <dbl> 738907, 768643, 579646, 159016, 544969, 569785, 569920, 22…

Using read_rds, we can access the already processed and geocoded data from rds without needing to run through the geocoding function again. Since the data is in the WGS coordinate system, we can use st_transform() to project it to the SVY21 coordinate system we will be using.

primarysch_df <- read_rds('data/rds/geocoded_primarysch.rds')
primarysch_sf <- primarysch_df %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs=4326) %>%
  st_transform(crs = 3414)

2.1.9 Inferring CBD

Finally, let us factor in the proximity to the Central Business District - in the Downtown Core. For this, let us take the coordinates of Downtown Core to be the coordinates of the CBD:

lat <- 1.287953
lng <- 103.851784

cbd_sf <- data.frame(lat, lng) %>%
  st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
  st_transform(crs=3414)

2.1.10 Writing Into RDS

To later use this for visualisation in our shiny app, we can use write_rds to store it.

write_rds(kindergarten_sf, 'data/rds/geospatial/kindergarten_sf.rds')
write_rds(childcare_sf, 'data/rds/geospatial/childcare_sf.rds')
write_rds(hawker_sf, 'data/rds/geospatial/hawker_sf.rds')
write_rds(busstop_sf, 'data/rds/geospatial/busstop_sf.rds')
write_rds(shoppingmall_sf, 'data/rds/geospatial/shoppingmall_sf.rds')
write_rds(mrt_sf, 'data/rds/geospatial/mrt_sf.rds')
write_rds(primarysch_sf, 'data/rds/geospatial/primarysch_sf.rds')
write_rds(cbd_sf, 'data/rds/geospatial/cbd_sf.rds')

2.2 Importing Aspatial Data

2.2.1 Importing Rental Flat

The code chunk below is used to import the rental data from data.gov.sg.

rental_df = read_csv('data/aspatial/RentingOutofFlats2024CSV.csv')

To get a brief overview of existing columns of this dataset, we can use colnames() to do so.

colnames(rental_df)
[1] "rent_approval_date" "town"               "block"             
[4] "street_name"        "flat_type"          "monthly_rent"      

2.2.1.1 Converting rent_approval_date to a Valid Date Format

Since the rent_approval_date is in the chr format, we will want to convert it to the date format so that we can later better access and use this variable. This is done so by the ym() as shown in the chunk of code below.

rental_df$rent_approval_date <- ym(rental_df$rent_approval_date)

2.2.1.2 Filtering For 2024

Since the dataset is rather large, we want to size down our scope and instead focus on only the 2024 data, which in this case is from Jan 2024 to Sep 2024.

rental_df <- rental_df %>%
  filter(year(rent_approval_date) == 2024)

2.2.1.3 Geocoding Rental Flat Data Using OneMap API

Like the primary school data, we face the similar problem here thus we will need to go through the geocoding process similarly to what we have done above. The geocoding function:

Click to view the code
geocode <- function(block, streetname) {
  base_url <- "https://www.onemap.gov.sg/api/common/elastic/search"
  address <- paste(block, streetname, sep = " ")
  query <- list("searchVal" = address, 
                "returnGeom" = "Y",
                "getAddrDetails" = "N",
                "pageNum" = "1")
  
  res <- GET(base_url, query = query)
  restext<-content(res, as="text")
  
  output <- fromJSON(restext)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}

This chunk of code then calls upon the above function for every single row of the rental_df and writes it to the rds.

rental_df$LATITUDE <- 0
rental_df$LONGITUDE <- 0

for (i in 1:nrow(rental_df)){
  temp_output <- geocode(rental_df[i, 3], rental_df[i, 4])
  print(i)
  
  rental_df$LATITUDE[i] <- temp_output$results.LATITUDE
  rental_df$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
write_rds(rental_df, 'data/rds/geocoded_rental_2024.rds')

Without needing to run the above time-consuming method yet again, we can just read the data from the rds here.

rental_df <- read_rds('data/rds/geocoded_rental_2024.rds')

2.2.1.4 CRS Adjustments

Another important step after importing the dataset is checking the coordinate system used, as seen in the result below using st_crs(), we can see that there is no CRS stated for rental_df.

st_crs(rental_df)
Coordinate Reference System: NA

Therefore, we need to convert the longitude and latitude columns into a spatial format. Since our dataset is based in Singapore and it uses the SVY21 coordinate reference system (CRS Code: 3414), we will use the st_transform() function to perform the conversion and create the geometry column.

rental_sf <- rental_df %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs=4326) %>%
  st_transform(crs = 3414)

Using st_crs(), we can check and verify that the conversion is successful.

st_crs(rental_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Simple feature collection with 6 features and 6 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 15786.61 ymin: 30769.77 xmax: 39668.39 ymax: 45634.94
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 7
  rent_approval_date town            block street_name    flat_type monthly_rent
  <date>             <chr>           <chr> <chr>          <chr>            <dbl>
1 2024-01-01         YISHUN          386   YISHUN RING RD 4-ROOM            3700
2 2024-01-01         JURONG WEST     140B  CORPORATION DR 4-ROOM            3900
3 2024-01-01         SENGKANG        471B  FERNVALE ST    5-ROOM            3700
4 2024-01-01         KALLANG/WHAMPOA 10    GLOUCESTER RD  3-ROOM            3600
5 2024-01-01         BEDOK           31    BEDOK STH AVE… 5-ROOM            4350
6 2024-01-01         QUEENSTOWN      82    STRATHMORE AVE 4-ROOM            3000
# ℹ 1 more variable: geometry <POINT [m]>

2.2.1.5 Checking for NA values

This chunk of code checks the dataset for any na values in all of the columns. As shown below, there is none.

rental_sf %>%
  summarise(across(everything(), ~ sum(is.na(.)))) -> extra_NA 
extra_NA
Simple feature collection with 1 feature and 6 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: 11519.15 ymin: 28097.64 xmax: 45192.3 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
# A tibble: 1 × 7
  rent_approval_date  town block street_name flat_type monthly_rent
               <int> <int> <int>       <int>     <int>        <int>
1                  0     0     0           0         0            0
# ℹ 1 more variable: geometry <MULTIPOINT [m]>

3. Data Wrangling

3.1 Removal of Redundant Columns

To increase efficiency and reduce the data size, we can remove columns we do not need like the block and street_name in which we have already utilised previously and now have no use for.

# Define columns to be removed
columns_to_remove <- c("block","street_name")

# Remove columns only if they exist in the dataframe
rental_sf <- rental_sf %>%
  dplyr::select(-all_of(columns_to_remove[columns_to_remove %in% names(rental_sf)]))

3.2 Filter By Flat Type

Let us get an overview of the distributions of the housing types. As shown in the histogram, we can see that there is significantly less data for flat types like 1-room, 2-room, and executive housing.

# Create a summary of counts for each remaining lease range
count_data <- rental_sf %>%
  group_by(flat_type) %>%
  summarise(count = n())

# Create the bar plot with labels
ggplot(count_data, aes(x = flat_type, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Count of Flat Type",
       x = "Flat Type",
       y = "Count") +
  theme_minimal()

Hence, we will focus on analyzing the 3-room, 4-room, and 5-room flats since they show a more substantial presence in the dataset compared to smaller flat types.

rental_sf <- rental_sf %>% filter (flat_type == '3-ROOM' | flat_type == '4-ROOM' |flat_type == '5-ROOM' )

3.3 Adding Region to Rental Data

This chunk of code performs a left join with mpsz_sfto categorise the different flats into different regions in order to better understand the rental trends.

3.3.1 Left Joining with mpsz_sf

# Perform the left join by dropping the geometry from 'datab' and only bringing in 'region_n'
rental_sf <- rental_sf %>%
  left_join(st_drop_geometry(mpsz_sf) %>% select(PLN_AREA_N, REGION_N) %>% distinct(PLN_AREA_N, .keep_all = TRUE), 
            by = c("town" = "PLN_AREA_N"))

3.3.2 Identifying Rows with NA values

Then, let’s perform a check to see if any of the rows have na values in the newly created column and display it. As shown here we can see that there are multiple rows in which the town column was unable to find a matching value in the mpsz_sf PLN_AREA_N column.

rental_sf_with_na <- rental_sf %>%
  filter(is.na(REGION_N))

rental_sf_with_na
Simple feature collection with 1471 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 19536.43 ymin: 28634.73 xmax: 33665.81 ymax: 41493.47
Projected CRS: SVY21 / Singapore TM
# A tibble: 1,471 × 6
   rent_approval_date town      flat_type monthly_rent            geometry
 * <date>             <chr>     <chr>            <dbl>         <POINT [m]>
 1 2024-01-01         KALLANG/… 3-ROOM            3600 (30102.41 32911.75)
 2 2024-01-01         KALLANG/… 3-ROOM            2300 (19536.43 41493.47)
 3 2024-01-01         CENTRAL   3-ROOM            3000 (29435.92 29669.46)
 4 2024-01-01         KALLANG/… 3-ROOM            1850 (19536.43 41493.47)
 5 2024-01-01         KALLANG/… 4-ROOM            3400 (31184.91 34078.85)
 6 2024-01-01         KALLANG/… 5-ROOM            4100  (31013.62 33175.3)
 7 2024-01-01         KALLANG/… 3-ROOM            3200 (31618.57 33708.83)
 8 2024-01-01         KALLANG/… 3-ROOM            2400 (31228.06 33432.22)
 9 2024-01-01         CENTRAL   3-ROOM            2850 (29067.24 29360.52)
10 2024-01-01         KALLANG/… 3-ROOM            2700 (31547.37 31835.93)
# ℹ 1,461 more rows
# ℹ 1 more variable: REGION_N <chr>

Using unique, we can identify the town values of these problematic rows and also the available regions in mpsz_sf so that we have a brief idea of what are the possible values we can later use. In particular, the problematic values are ‘Kallang/Whampoa’ and ‘Central’.

unique (rental_sf_with_na$town)
[1] "KALLANG/WHAMPOA" "CENTRAL"        
unique(mpsz_sf$REGION_N)
[1] "CENTRAL REGION"    "WEST REGION"       "EAST REGION"      
[4] "NORTH-EAST REGION" "NORTH REGION"     

Since the value is Kallang/Whampoa, let’s try to find the region of either Kallang or Whampoa through the filter()` function.

test <- mpsz_sf%>% filter(PLN_AREA_N == 'KALLANG' | PLN_AREA_N == 'WHAMPOA') %>% select(PLN_AREA_N,REGION_N)
test
Simple feature collection with 9 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 29224.85 ymin: 30694.74 xmax: 33809.38 ymax: 34738.73
Projected CRS: SVY21 / Singapore TM
  PLN_AREA_N       REGION_N                       geometry
1    KALLANG CENTRAL REGION POLYGON ((31632.98 30741.73...
2    KALLANG CENTRAL REGION POLYGON ((31915.99 31851.24...
3    KALLANG CENTRAL REGION POLYGON ((31494.25 32088.46...
4    KALLANG CENTRAL REGION POLYGON ((32710.73 33608.07...
5    KALLANG CENTRAL REGION POLYGON ((31277.37 34723.29...
6    KALLANG CENTRAL REGION POLYGON ((31389.56 32098.17...
7    KALLANG CENTRAL REGION POLYGON ((30344.12 32879.39...
8    KALLANG CENTRAL REGION POLYGON ((32160.87 32549.12...
9    KALLANG CENTRAL REGION POLYGON ((31437.02 32345.27...

While we can’t find a match for Whampoa, we can see that Kallang falls under the Central Region. From the naming, we can also make the deduction that the town ‘Central’ likely falls under the same region. Thus by using a if_else statement we can assign the region Central Region to these towns.

rental_sf <- rental_sf %>%
  mutate(REGION_N = if_else(town == 'CENTRAL' | town == 'KALLANG/WHAMPOA', 'CENTRAL REGION', REGION_N))

Let us also rename the column to standardise the namings.

rental_sf <- rental_sf %>% rename(region = REGION_N)

3.4 Calculate Number of Facilities Within A Certain Distance & Proximity To Nearest Facility

Since the number of facilities within range and proximity to certain facilities are some of the most important factors of rental prices, it is important for us to include that in our analysis as well. Thus to do so we have the below function to made these calculations based on the locations of the different facilities’ datasets we have imported compared with the individual rental flats themselves.

Note

Note: the calculateNumberOffacilities is a parameter used to indicate if the calculation of facilities for a particular facility is required.

Click to view the code
calculate_facilities_and_proximity <- function(dataset1, dataset2, name_of_col_facilities, name_of_col_proximity, radius, calculateNumberOfFacilities) {
  # Calculate distance matrix
  dist_matrix <- st_distance(dataset1, dataset2) %>%
    drop_units()
  
  if (calculateNumberOfFacilities){
  # Calculate the number of facilities within the specified radius
    dataset1[[name_of_col_facilities]] <- rowSums(dist_matrix <= radius)
  }
  # Calculate the proximity to the nearest facility
  dataset1[[name_of_col_proximity]] <- rowMins(dist_matrix)
  
  return(dataset1)
}

The below chunk of code calls upon the calculate_facilities_and_proximity() based on the different parameters stated for each facility. We indicated for the mrt and primary school to not be included in the calculations for the count within a certain radius as the distance to such facilities has way more importance than the actual count of it which is usually one within a certain range since these facilities are more spread out.

Click to view the code
rental_sf <- 
  calculate_facilities_and_proximity(
    rental_sf, kindergarten_sf, "no_of_kindergarten_500m", "prox_kindergarten", 500, TRUE
  ) %>%
  calculate_facilities_and_proximity(
    ., childcare_sf, "no_of_childcare_500m", "prox_childcare", 500, TRUE
  ) %>%
  calculate_facilities_and_proximity(
    ., hawker_sf, "no_of_hawker_500m", "prox_hawker", 500, TRUE
  ) %>%
  calculate_facilities_and_proximity(
    ., busstop_sf, "no_of_busstop_500m", "prox_busstop", 500, TRUE
  ) %>%
  calculate_facilities_and_proximity(
    ., shoppingmall_sf, "no_of_shoppingmall_1km", "prox_shoppingmall", 1000, TRUE
  ) %>% 
  calculate_facilities_and_proximity(
    ., mrt_sf, "x", "prox_mrt", 1000, FALSE
  ) %>%
  calculate_facilities_and_proximity(
    ., primarysch_sf, "x", "prox_prisch", 1000, FALSE
  ) %>%
  calculate_facilities_and_proximity(
    ., cbd_sf, "x", "prox_cbd", 1000, FALSE
  )


# Writing to RDS
write_rds(rental_sf,'data/rds/rental_sf.rds')

Likewise, to skip the whole time-consuming process, we can instead read the rds data using the below code.

rental_sf <- read_rds('data/rds/rental_sf.rds')
glimpse(rental_sf)
Rows: 25,713
Columns: 19
$ rent_approval_date      <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-0…
$ town                    <chr> "YISHUN", "JURONG WEST", "SENGKANG", "KALLANG/…
$ flat_type               <chr> "4-ROOM", "4-ROOM", "5-ROOM", "3-ROOM", "5-ROO…
$ monthly_rent            <dbl> 3700, 3900, 3700, 3600, 4350, 3000, 3800, 3600…
$ geometry                <POINT [m]> POINT (29463.95 45634.94), POINT (15786.…
$ region                  <chr> "NORTH REGION", "WEST REGION", "NORTH-EAST REG…
$ no_of_kindergarten_500m <dbl> 1, 1, 0, 1, 1, 6, 3, 1, 1, 1, 1, 0, 2, 4, 0, 0…
$ prox_kindergarten       <dbl> 3.717727e+02, 1.241314e+02, 6.531547e+02, 4.47…
$ no_of_childcare_500m    <dbl> 10, 9, 4, 3, 6, 11, 9, 9, 7, 6, 4, 1, 8, 10, 2…
$ prox_childcare          <dbl> 6.318731e+01, 7.642944e+01, 8.264710e+01, 4.47…
$ no_of_hawker_500m       <dbl> 1, 0, 0, 1, 2, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0…
$ prox_hawker             <dbl> 478.4537, 840.4254, 660.6058, 332.1417, 354.77…
$ no_of_busstop_500m      <dbl> 17, 17, 6, 11, 13, 17, 15, 8, 16, 11, 5, 6, 17…
$ prox_busstop            <dbl> 174.00119, 80.37739, 70.48567, 161.93848, 280.…
$ no_of_shoppingmall_1km  <dbl> 1, 1, 1, 2, 0, 3, 1, 2, 1, 2, 1, 0, 3, 4, 0, 1…
$ prox_shoppingmall       <dbl> 704.70468, 905.82230, 735.18898, 565.82028, 10…
$ prox_mrt                <dbl> 1259.97262, 1869.01210, 197.18773, 175.98753, …
$ prox_prisch             <dbl> 271.15943, 1353.83517, 86.23193, 234.73109, 57…
$ prox_cbd                <dbl> 15605.314, 14916.316, 12419.101, 2871.311, 103…

4. Overview Of Dataset

Below is an overview of what our dataset currently consists of.

colnames(rental_sf)
 [1] "rent_approval_date"      "town"                   
 [3] "flat_type"               "monthly_rent"           
 [5] "geometry"                "region"                 
 [7] "no_of_kindergarten_500m" "prox_kindergarten"      
 [9] "no_of_childcare_500m"    "prox_childcare"         
[11] "no_of_hawker_500m"       "prox_hawker"            
[13] "no_of_busstop_500m"      "prox_busstop"           
[15] "no_of_shoppingmall_1km"  "prox_shoppingmall"      
[17] "prox_mrt"                "prox_prisch"            
[19] "prox_cbd"               

Dependent Variables:

  • Monthly Rental: monthly_rent

Explanatory Variables:

Continuous

  • Prox_ [distance to closest]: kindergarten, childcare, hawker, bus stops, shopping mall, mrt, primary schools, cbd

  • Count of xx within xx distance: no_of_kindergarten_500m, no_of_childcare_500m, no_of_hawker_500m, no_of_busstop_500m, no_of_shoppingmall_1km

Categorical

  • Flat Type: flat_type

  • Town: town

  • Region: region

5. Shiny Storyboard (EDA)

Since I am in charge of the exploratory data analysis. sections for my group’s shiny app, here is a brief overview of the storyboard of the visualisations we will be doing. This is not final and only serves as a guideline of how our app will look like.

5.1 Histogram

5.2 Choropleth Map

5.3 Scatterplot

5.4 Locations of Interest

6. Histograms

Continuous

Options Values
Filter month_year, flat_type, town
Choose Variables for Plotting monthly_rent, prox_hawker etc

Categorical

Options Values
X variables flat_type, region
Y variables frequency, median_rent

6.1 Categorical Variables

6.1.1 Plotting by count of data

Click to view the code
# Create a summary of counts for each remaining lease range
count_data <- rental_sf %>%
  group_by(flat_type) %>%
  summarise(count = n())

# Create the bar plot with labels
ggplot(count_data, aes(x = flat_type, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Count of Flat Type",
       x = "Flat Type",
       y = "Count") +
  theme_minimal()

6.1.2 Plotting By Median Rental Price

Click to view the code
# Get Median Rent Data
median_rent_data <- rental_sf %>%
  group_by(flat_type) %>%
  summarise(median_rent = median(monthly_rent, na.rm = TRUE),
        .groups = 'drop'
  )

# Create the bar plot with labels
ggplot(median_rent_data, aes(x = flat_type, y = median_rent)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = median_rent), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Median Rent of Different Flat Type",
       x = "Flat Type",
       y = "Median Rent") +
  theme_minimal()

6.2 Continuous Variables

Click to view the code
selected_month = "2024 Jan"
selected_flat_type = '3-ROOM'
selected_town = 'ANG MO KIO'
selected_plot_variable = 'monthly_rent'
bin_number = 10
# Filter based on params
histo_filtered_rental_sf <- rental_sf %>%
      mutate(year_month = format(rent_approval_date, "%Y %b")) %>%
      filter (if (selected_month == "2024 Jan to Sep") {
        TRUE  # No additional filtering by month
      } else {
        format(rent_approval_date, "%Y %b") == selected_month  # Filter by selected year-month
      }) %>%
      filter(flat_type == selected_flat_type) %>%
      filter(town == selected_town)

# Create the histogram with labels
ggplot(histo_filtered_rental_sf,
                  aes_string(x = selected_plot_variable)) +
        geom_histogram(
          bins = bin_number,
          fill = "blue",
          color = "black"
        ) +  
        labs(
          title = paste("Histogram of", selected_plot_variable),
          x = selected_plot_variable,
          y = "Frequency"
        )

7. Choropleth Mapping

Options Values
Filter month_year and flat_type
Plot different variables Median Monthly Rent, Rented Flat Count
Note

Note: As Kallang/Whampoa does not match any towns in mpsz_sf, we will map it to Kallang. On the other hand for the town ‘Central’, since we are not able to make any deduction based on this naming and it also does not match any region values in mpsz_sf, we can only remove it from the choropleth mapping.

7.1 Example for Rented Flat Count

selected_month = "2024 Jan"
selected_flat_type = '3-ROOM'
# Dealing with Kallang/Whampoa & Central town which is not present in mpsz
# Filter based on month and flat type
ch_filtered_rental_sf <- rental_sf %>%
  mutate(town = if_else(town == 'KALLANG/WHAMPOA', 'KALLANG', town)) %>%   
  filter(town != "CENTRAL") %>% 
  filter (if (selected_month == "2024 Jan to Sep") {
        TRUE  # No additional filtering by month
      } else {
        format(rent_approval_date, "%Y %b") == selected_month  # Filter by selected year-month
      }) %>%
      filter(flat_type == selected_flat_type)
  
# Create choropleth_data
choropleth_data <- ch_filtered_rental_sf  %>%
  group_by(town) %>%
  summarize(rented_count = n(), .groups = 'drop') %>%
  st_drop_geometry() %>%
  right_join(combined_town_mpsz_sf, 
            by = c("town" = "PLN_AREA_N")) %>%
  st_as_sf()  # Convert to sf object after joining
tmap_mode("view")
qtm(choropleth_data, 
    fill = "rented_count")

7.2 Example of Median Monthly Rent

selected_month = "2024 Jan"
selected_flat_type = '3-ROOM'
# Dealing with Kallang/Whampoa & Central town which is not present in mpsz
# Filter based on month and flat type
ch_filtered_rental_sf <- rental_sf %>%
  mutate(town = if_else(town == 'KALLANG/WHAMPOA', 'KALLANG', town)) %>%
  filter(town != "CENTRAL") %>%
  filter (if (selected_month == "2024 Jan to Sep") {
    TRUE  # No additional filtering by month
  } else {
    format(rent_approval_date, "%Y %b") == selected_month  # Filter by selected year-month
  }) %>%
  filter(flat_type == selected_flat_type)

# Create choropleth_data
choropleth_data <- ch_filtered_rental_sf  %>%
  group_by(town) %>%
  summarize(median_rent = median(monthly_rent, na.rm = TRUE),
            .groups = 'drop') %>%
  st_drop_geometry() %>%
  right_join(combined_town_mpsz_sf, by = c("town" = "PLN_AREA_N")) %>%
  st_as_sf()  # Convert to sf object after joining
tmap_mode("view")
qtm(choropleth_data, 
    fill = "median_rent")

8 Scatterplot

For this, we will generate scatterplots for any pair of continuous variables the user chooses. Similar to the previous parts, we will also include filter options for the users to filter the data.

Options Values
X variables any continuous, eg monthly_rent, prox_mrt
Y variables any continuous, eg prox_hawker, prox_prisch
Filter month_year, flat_type, town
selected_month = "2024 Jan"
selected_flat_type = '3-ROOM'
selected_town = 'YISHUN'
scatter_filtered_data <- rental_sf %>%
      mutate(year_month = format(rent_approval_date, "%Y %b")) %>%
      filter (if (selected_month == "2024 Jan to Sep") {
        TRUE  # No additional filtering by month
      } else {
        format(rent_approval_date, "%Y %b") == selected_month  # Filter by selected year-month
      }) %>%
      filter(flat_type == selected_flat_type)%>%
      filter(if (selected_town == 'all') TRUE else town == selected_town)
ggplot(scatter_filtered_data,
       aes(x = monthly_rent, y = prox_hawker)) +  # Refer to columns directly
  geom_point(color = 'blue', size = 1) +
  labs(
    title = paste("Scatterplot of Monthly Rent vs. Proximity to Hawker"),
    x = "Monthly Rent",
    y = "Proximity to Hawker"
  ) +
  theme_minimal()

9 Visualising Locations of Interest

Options Values
Filter by town town
Geospatial data to plot All geospatial datasets including Kindergarten, Primary School etc

Here is an example of how the user can visualise the locations of interest by inputting the town they want and also the categories of location (like bus stops, hawker centre locations etc).

The below example is for Bus Stops in Yishun.

# Filter for town boundaries in mpsz_sf
specific_town_sf <- mpsz_sf %>% filter(PLN_AREA_N == 'YISHUN')

# Performing a spatial intersection
filtered_busstops <- st_intersection(busstop_sf, specific_town_sf)

# Switch to interactive map mode
tmap_mode('view')

# Plot bus stops within Yishun
tm_shape(filtered_busstops) +
  tm_dots(col = 'red')

10 Final Result

Below are the screenshots of the final results. While most of the inputs are dropdowns, I also tried incorporating other different inputs like radio and sliders to increase the variety.

10.1 Histogram

With a histogram, the users can see the frequency and distribution of their selected variables.

10.1.1 Continuous

10.1.2 Categorical

10.2 Choropleth Map

The choropleth map on the other hand will provide them with a quick understanding of the spatial distribution of median rent and monthly rent across various towns in Singapore.

10.3 Scatterplot

With the scatterplot, we can easily see the relationship between pairs of variables and see how one correlates with another.

10.4 Locations of Interest

With this map, we can see the locations of different locations of interest that we have used to calculate values like the proximity to the nearest locations of interest and count of locations of interest within a certain radius.

11 Conclusion

For my part, I tried to include filters wherever possible so that users can narrow down the data to find specific insights tailored to their needs. By allowing users to filter based on criteria such as flat type, rental prices, and geographic locations, they can focus on the most relevant information for their purposes. This functionality enhances the user experience by making the data exploration process more intuitive and efficient, enabling users to uncover trends and patterns that are directly applicable to their individual interests or decision-making processes.